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Abstract 


The  most  efficient  known  parallel  algorithms  for  inversion  of  a  nonsingular 
nXn  matrix  A  or  solving  a  linear  system  Ax  =  b  over  the  rationals  require 
0(log  n)2  time  and  M(n)\/n  processors  (where  M(n)  is  the  number  of  processors 
required  in  order  to  multiply  two  nXn  rational  matrices  in  time  0(log  n).)  Furth¬ 
ermore,  all  known  polylog  time  algorithms  for  those  problems  are  unstable:  they 
require  the  calculations  to  be  done  with  perfect  precision;  otherwise  they  give  no 
results  at  all. 

This  paper  describes  parallel  algorithms  that  have  good  numerical  stability 
and  remain  efficient  as  n  grows  large.  In  particular,  we  describe  a  quadratically 
convergent  iterative  method  that  gives  the  inverse  (within  the  relative  precision 
2_n0<,>)  of  an  nXn  rational  matrix  A  with  condition  <  n0^)  in  0(log  n)2  time 
using  M(n)  processors.  This  is  the  optimum  processor  bound  and  a  \/n  improve¬ 
ment  of  known  processor  bounds  for  polylog  time  matrix  inversion.  It  is  the  first 
known  polylog  time  algorithm  that  is  numerically  stable.  The  algorithm  relies  on 

our  method  of  computing  an  approximate  inverse  of  A  that  involves  0(log  n) 
parallel  steps  and  n2  processors. 

Also,  we  give  a  parallel  algorithm  for  solution  of  a  linear  system  Ax  ==  b 
with  a  sparse  nXn  symmetric  positive  definite  matrix  A.  If  the  graph  G(A) 
(which  has  n  vertices  and  has  an  edge  for  each  nonzero  entry  of  A)  is  s(n)- 
separable,  then  our  algorithm  requires  only  0((log  n)(log  s(n))2)  time  and 
|  E  |  +  M(s(n))  processors.  The  algorithm  computes  a  recursive  factorization  of 
A  so  that  the  solution  of  any  other  linear  system  Ax  =  b'  with  the  same  matrix 
A  requires  only  0(log  n  log  s(n))  time  and  |  E  |  +  s(n)2  processors. 
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Abstract 

The  most  efficient  known  parallel  algorithms  for  inversion  of  a  nonsingular 

nXn  matrix  A  or  solving  a  linear  system  Ax  =  b  over  the  rationals  require 

h  ^  rjwt'r  sV/,e'  l’0‘t 

0(Iog  n)|f  time  and  M(n)\^i  processors  (where  M(n)  is  the  number  of  processors 
required  in  order  to  multiply  two  nXn  rational  matrices  in  time  0(log  n).)  Furth¬ 
ermore,  all  known  polylog  time  algorithms  for  those  problems  are  unstable:  they 
require  the  calculations  to  be  done  with  perfect  precision;  otherwise  they  give  no 
results  at  all. 

This  paper  describes  parallel  algorithms  that  have  good  numerical  stability 

and  remain  efficient  as  n  grows  large. \  In  particular,  we  describe  a  quadratically 

1 

convergent  iterative  method  that  gives  the  inverse  (within  the  relative  precision 
2"n0<")  of  an  nXn  rational  matrix  A  with  condition  <  n0^  in  0(log  n)2  time 
using  M(n)  processors.  This  is  the  optimum  processor  bound  and  a  >/n  improve¬ 
ment  of  known  processor  bounds  for  polylog  time  matrix  inversion.  It  is  the  first 
known  polylog  time  algorithm  that  is  numerically  stable.  The  algorithm  relies  on 
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our  method  of  computing  an  approximate  inverse  of  A  that  involves  0(log  n) 
parallel  steps  and  n2  processors. 

Also,  we  give  a  parallel  algorithm  for  solution  of  a  linear  system  Ax  =  b 
with  a  sparse  nXn  symmetric  positive  definite  matrix  A.  If  the  graph  G(A) 
(which  has  n  vertices  and  has  an  edge  for  each  nonzero  entry  of  A)  is  s(n)- 
separable,  then  our  algorithm  requires  only  0((log  n)(log  s(n))2)  time  and 
|  E  |  +  M(s(n))  processors.  The  algorithm  computes  a  recursive  factorization  of 
A  so  that  the  solution  of  any  other  linear  system  Ax  =  b'  with  the  same  matrix 
A  requires  only  0(log  n  log  s(n))  time  and  |  E  |  +  s(n)2  processors. 

1.  Introduction 

Recently  it  has  become  feasible  to  construct  computer  architectures  with  a 
large  number  of  processors.  We  assume  the  parallel  machine  model  of  [Borodin, 
von  zur  Gathen,  and  Hopcroft,  82j,  where  on  each  step  each  processor  can  do  a 
single  rational  addition,  subtraction,  multiplication,  or  division.  In  this  paper  we 
are  concerned  about  the  efficient  use  of  this  parallelism  for  solving  some  funda¬ 
mental  numerical  problems  such  as 

(1)  INVERT:  given  an  nXn  rational  matrix  A  =  (ajj),  then  output  A-1  within  a 
prescribed  accuracy  €  if  A  is  well-conditioned,  else  output  ill-conditioned. 

(2)  LINEAR-SOLVE:  given  a  well-conditioned  nXn  matrix  A  and  a  column  vec¬ 
tor  b  of  length  n,  find  x  =  A~'b  within  a  prescribed  accuracy  e. 

We  say  that  matrix  A  is  well-conditioned  if  cond  A  <  C  for  a  certain  param¬ 
eter  C.  We  say  that  we  have  computed  A-1  within  accuracy  e  if  the  norm  of  the 
error  matrix  divided  by  the  norm  of  A'1  does  not  exceed  f. 

We  are  interested  in  parallel  algorithms  that  have  good  numerical  stability, 
polylog  time  bounds  and  small  processor  bounds  for  INVERT  and  LINEAR- 
SOLVE.  Certainly  LINEAR-SOLVE  can  be  immediately  reduced  to  INVERT  by 
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the  multiplication  of  A-1  by  b.  Such  a  reduction  of  LINEAR-SOL VE  to  INVERT 
is  particularly  appropriate  if  several  linear  systems  Ax  =  b  with  the  same  A  and 
different  b  must  be  solved. 

We  will  present  the  algorithms  for  INVERT  and  for  LINEAR-SOLVE  that 
can  be  applied  for  any  choice  of  C  and  e.  On  the  other  hand,  the  complexity 
estimates  of  our  algorithms  depend  on  the  choice  of  C  and  e.  In  this  paper  we 
will  be  primarily  concerned  about  the  complexity  estimates  in  the  case  where 
C  =  nc,  t  =  2"“'  for  some  positive  constants  c  and  c#  ,  say  c=100,  c1  =10. 
This  covers  all  instances  of  practical  interest.  For  the  sake  of  completeness,  we 
will  ?lso  supply  the  estimates  in  the  case  of  arbitrary  C  and  £.  To  simplify  the 
estimates,  we  will  assume  that  the  arithmetic  operations  are  performed  with 
infinite  precision.  In  addition,  we  will  present  the  error  estimates  in  the  case  of 
finite  precision  computation;  this  will  demonstrate  that  our  iterative  algorithms 
are  stable,  moreover,  two  of  them  are  self-correcting. 

1.1.  Previous  Work  On  Parallel  Matrix  Inversion 

Parallel  algorithms  with  simultaneous  polylog  time  and  polynomial  processor 
bounds  are  known  to  exist  for  these  problems,  but  their  practical  utility  is  limited 
due  to  their  large  processor  bounds.  Furthermore  these  known  algorithms  have 
numerical  stability  problems;  if  the  calculations  are  not  taken  in  exact  arithmetic, 
their  outputs  may  substantially  differ  from  A-1,  so  as  not  to  constitute  an 
approximate  inverse. 

[Csanky,  76]  gave  the  first  proof  that  INVERT,  over  fields  of  characteristic 
0,  can  be  done  in  polylog  time.  The  result  was  at  that  time  surprising,  and  the 
proof  is  quite  elegant.  Csanky  used  the  Cayley-Hamilton  theorem  to  reduce  the 
problem  of  matrix  inversion  to  essentially  the  problem  of  computing  O(n)  pro¬ 
ducts  of  nXn  matrices.  Let  M(n)  >  n2  be  the  number  of  processors  sufficient  in 
order  to  multiply  two  nXn  matrices  in  Q(log  n)  time.  By  the  upper  bounds  of 


Codaa 

i/or 
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(Chandra,  76),  M(n)  <  n281.  We  easily  extend  that  result  as  follows.  If  nXn 
matrices  can  be  multiplied  involving  0(n")  arithmetic  operations  for  some  u  >  2, 
then  M(n)  =  n",  see  Appendix  A.  The  current  best  upper  bound  on  u  is  2.495...; 
however,  for  matrices  of  smaller  sizes  we  should  count  only  on  M(n)=n3  due  to 
the  considerable  overhead-  of  the  known  asymptotically  fast  algorithms  for  matrix 
multiplication,  see  (Coppersmith  and  Winograd,  82)  and  compare  (Pan,  84). 
Csanky’s  algorithm  takes  simultaneous  time  0(log  n)2  and  nM(n)  processors. 
(Preparata  and  Sarwate,  78)  reduced  the  processor  bounds  to  Vn  M(n).  If  nonex¬ 
act  arithmetic  is  used,  as  this  occurs  in  practice,  both  of  these  methods  suffer 
from  numerical  instability  due  to  the  use  of  the  Cayley-Hamilton  theorem  (see 
discussion  in  (Wilkinson,  61,65)). 

Two  methods  for  parallel  matrix  inversion  over  arbitrary  fields  of  constants 
are  known.  (Borodin,  von  zur  Gathen,  and  Hopcroft,  82)  observe  that  the  results 
of  [Valiant,  Skyum,  Berkowitz,  and  Rackoff,  83)  and  (Strassen,  73)  combined  can 
be  used  to  parallelize  the  sequential  Gaussian  elimination  algorithm  for  matrix 
inversion.  This  yields  a  simultaneous  0(log  n)2  time  and  (very  large)  polynomial 
processor  algorithm  for  INVERT  over  any  field.  (As  a  matter  of  fact,  the 
straightforward  parallelization  of  the  Gaussian  elimination  only  yields  O(n)  time, 
n2  processor  algorithms  for  both  INVERT  and  LINEAR-SOLVE.)  [Berkowitz,  84) 
has  another  such  matrix  inversion  algorithm  over  arbitrary  fields.  Neither  of 
these  two  polylog  time  algorithms  can  be  viewed  as  practical  in  the  case  of  real, 
complex,  or  rational  inputs. 

1.2.  Our  Results  for  INVERT. 

We  have  two  main  results: 

(1)  a  parallel  iterative  method  for  INVERT  in  the  case  of  well-conditioned 

matrices,  and 


(2)  a  parallel  method  for  LINEAR-SOLVE  in  the  case  of  symmetric  positive 

definite  sparse  matrices. 

Both  algorithms  run  in  polylog  time  and  give  significant  decrease  of  the  known 
processor  bounds. 

The  iterative  methods  that  we  describe  in  Section  2  and  Appendix  C  are 
known  ones,  namely,  the  Newton  iteration  and  its  extensions.  The  first  use  of 
the  Newton  iteration  to  invert  a  matrix  is  attributed  by  Householder  to  [Schultz, 
33]  but  frequently  goes  by  the  name  of  [Hotelling,  43a,b]  and  [Bodewig,  59].  (See 
[Householder,  64],  [Isaacson  and  Keller,  66],  and  [Newman,  82]  for  more  recent 
descriptions  of  such  methods.)  [Bojariczyk,  84]  made  still  another  rediscovery  of 
Newton’s  method  and  proposed  it  for  parallel  computation. 

The  Newton  method  is  not  normally  recommended  for  matrix  inversion  in 
textbooks  on  numerical  analysis,  since  an  initial  approximate  inverse  B  of  a  given 
matrix  A  is  required  such  that  ]|I  -  BA||  is  substantially  less  than  I.  The  prob¬ 
lem  of  efficiently  (i.e.,  without  inverting  A)  finding  such  an  approximate  inverse 
of  a  well-conditioned  matrix  A  was  an  open  problem  in  numerical  analysis  for  the 
last  50  years  since  [Schultz,  33].  For  a  strictly  diagonally  dominant  matrix  A  an 
approximate  inverse  B  of  A  is  an  easily  computable  diagonal  matrix,  but  no 
known  techniques  give  such  B  for  a  general  matrix  A. 

We  observe  that  it  suffices  to  define  an  approximate  inverse  B  such  that 
||I  -  BA||  <  1  -  (l/n°  W).  Then,  since  the  Newton  method  is  quadratically  con¬ 
vergent,  0(log  n)  iterations  are  sufficient  for  numerical  computation  of  A-1  (up  to 
accuracy  2_n0<11).  Each  iteration  takes  only  0(log  n)  time  and  M(n)  processors. 
In  that  situation  we  managed  to  make  the  Newton  iteration  work  by  contributing 
a  method  for  computing  an  approximate  inverse  B  of  any  given  well-conditioned 
matrix  A  such  that  ||I  -  BA||  <  1  -  (l/n°W),  see  Lemma  2.4  in  Section  2  and  its 
substantiation  in  Section  3.  The  evaluation  of  such  B  involves  only  3n2-2n+l 


arithmetic  operations  and  2n-2  comparisons  that  can  be  implemented  using  0(log 
n)  parallel  steps  and  n2  processors,  see  Lemma  2.5  in  Section  2.  Our  resulting 
algorithm  computes  A-1  using  only  a  relatively  few  (only  0(log  n))  matrix  multi¬ 
plications,  that  is,  using  0((log  n)2)  time  and  M(n)  processors.  This  makes  the 
algorithm  efficient  for  both  sequential  and  parallel  computations.  Thus  we  have 
resolved  the  key  problem  that  previously  made  the  use  of  the  Newton  iteration 
impractical  for  computation  of  A'1.  Furthermore,  our  processor  bound  is  optimal 
because  matrix  multiplication  can  be  reduced  to  matrix  inversion,  see  [Borodin 
and  Munro,  75],  p.  51.  Our  parallel  and  sequential  time  bounds  are  optimal  o’ 
nearly  optimal  (up  to  within  the  factor  0(log  n)).  We  present  the  asymptotic 
complexity  estimates  but  the  reader  can  see  from  our  analysis  that  our  algorithms 
have  small  overhead,  so  they  are  efficient  and  practical  already  for  moderate  n. 

1.3.  Our  Results  for  LIN  EAR- SOLVE  in  the  Case  of  Sparse  Symmetric 
Positive  Definite  Systems. 

Our  another  contribution  is  a  method  for  LINEAR-SOLVE  for  sparse  sym¬ 
metric  positive  definite  linear  systems,  which  in  many  practical  cases  provides  a 
further  order  of  magnitude  improvement  in  processor  bounds  without  significant 
additional  time  cost.  Our  algorithm  drastically  differs  from  the  known  methods 
for  polylog  time  solution  of  linear  systems,  which  make  no  use  of  the  sparsity 
structure  of  the  matrix  A.  To  characterize  the  linear  systems  Ax=b  that  our 
algorithm  solve,  we  will  use  the  two  following  definitions. 

Definition  1.1.  Let  C  be  a  class  of  undirected  graphs  closed  under  the  sub¬ 
graph  relation,  that  is,  if  G  €  C  and  G'  is  a  subgraph  of  G,  then  G'  £  C.  The 
graphs  of  that  class  C  are  s(n)-separahle  if  there  exist  constants  n0  >  0  and 
a,  0  <  a  <  1,  such  that  for  each  graph  G  6  C  with  n  >  n0  vertices  there  is  a 
partition  V,V2,S  of  the  vertex  set  of  G  such  that 
|  Vt  (  <  on,  |  V2  |  <  on,  J  S  |  <  s(n),  and  G  has  no  edge  from  a  vertex  of  Vj 


to  V2  (hence  S  is  said  to  be  an  s(n)-separator  of  G  and  the  class  C  is  said  to  have 
the  $(n)-separator  property). 

Binary  trees  are  obviously  1-separable.  A  d-dimensional  grid  (of  a  uniform 
size  in  each  dimension)  is  n^/^-separable.  [Lipton  and  Tarjan,  79)  show  that 
the  planar  graphs  are  \/8n-separable  and  that  every  n-vertex  finite  element  graph 
with  <  k  boundary  vertices  in  every  element  is  4  |k/2j  Vn-separable. 

Definition  1.2.  Given  an  nXn  symmetric  matrix  A  =  (ajj),  we  define 
G(A)=(V,E)  to  be  the  undirected  graph  with  vertex  set  V={l,...,n}  and  edge  set 
E={{i,j}  |  a;j  0}.  A  is  sparse  if  |  E  |  =o(n2). 

The  very  large  linear  systems  Ax=b  that  arise  in  practice  often  are  sparse 
and  furthermore  have  graphs  G(A)  with  small  separators.  Important  examples  of 
such  systems  can  be  found  in  circuit  analysis  (e.g.,  in  the  analysis  of  the  electrical 
properties  of  a  VLSI  circuit),  in  structural  mechanics  (e.g.,  in  the  stress  analysis 
of  large  structures),  and  in  fluid  mechanics  (e.g.,  in  the  design  of  airplane  wings 
and  in  weather  prediction).  These  problems  require  the  solution  of  (nonlinear) 
partial  differential  equations,  which  are  then  closely  approximated  by  very  large 
linear  difference  equations  whose  graphs  are  generally  planar  graphs  or  3- 
dimensional  grids.  The  standard  weather  prediction  model  for  the  U.S.A.,  for 
example,  consists  of  a  3-dimensional  grid  with  a  very  large  number  n  of  grid 
points,  but  this  grid  has  only  a  constant  height  h  between  7  and  20,  and  hence  it 
is  at  most  Vim-separable. 

In  general,  the  inverse  of  a  sparse  matrix  A  (even  of  one  with  small  separa¬ 
tors)  is  dense.  Our  algorithm  for  LINEAR-SOL VE  avoids  computing  the  inverse 
matrix,  and  instead  computes  a  special  factorization  of  A.  For  sparse  matrices 
with  small  separators,  our  polylog  time  algorithm  yields  processor  bounds  that 
are  an  order  of  magnitude  lower  than  the  bounds  attained  by  other  polylog  time 
parallel  algorithms,  which  compute  the  inverse  matrix.  Specifically,  given  an 
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nXn  positive  definite  symmetric  matrix  A  such  that  G(A)  is  s(n)-separable  and 
s(n)  is  of  the  form  n*  for  a  constant  <7,  then  we  can  compute  the  special  recursive 
factorization  of  A  in  0((log  n)(log  s(n))2)  time  using  |  E  |  +  M(s(n))  processors. 
Then,  given  this  recursive  factorization,  the  solution  of  Ax=b  for  any  given  b 
requires  only  0(log  n  log  s(n))  time  and  |  E  |  4-  (s(n))2  processors. 

The  key  idea  is  to  use  a  nested  dissection  of  G(A)  in  order  to  reduce  the 
problem  to  inverting  dense  matrices  of  sizes  at  most  s(n)  X  s(n),  The  idea  of 
nested  dissection  is  well  known  for  sequential  methods.  It  was  first  proposed  by 
[George,  73]  for  grid  graphs  and  later  generalized  by  [Lipton,  Rose,  and  Tarjan, 
79]  to  graphs  with  small  separators.  We  are  the  first  to  apply  a  nested  dissection 
to  yield  a  parallel  algorithm  with  polylog  time  bounds.  The  extension  of  the  idea 
of  nested  dissection  from  the  sequential  case  to  the  parallel  case  was  not  immedi¬ 
ate  since  many  sets  of  separators  must  be  eliminated  at  each  parallel  step. 
Furthermore,  in  the  parallel  case  we  need  a  special  recursive  factorization  of  A, 
distinct  from  the  LDLT-factorization  used  in  sequential  nested  dissection. 

Let  us  comment  on  the  complexity  of  our  algorithm.  At  first  we  will  assume 
the  practical  bound  M(n)  =  n3  for  matrix  multiplication.  It  is  significant  that 
our  parallel  nested  dissection  algorithm  has  processor  bounds  that  are  substantial 
practical  improvements  over  the  previously  known  bounds. 

Let  G  be  a  fixed  planar  graph  such  that  0(v^n)-separators  are  known  for  G 
and  its  subgraphs.  (For  example,  G  might  be  a  v^n  X  Vn  grid  graph).  Then  for 
any  nXn  matrix  A  such  that  G  =  G(A)  our  parallel  nested  dissection  algorithm 
takes  0(log  n)3  time  and  n1  5  processors  to  compute  the  special  recursive  factori¬ 
zation  of  A,  and  then  0(log  n)2  time  and  n  processors  to  solve  any  linear  system 
Ax  =  b  with  A  fixed.  We  have  the  same  time  bounds  and  the  processor  bounds 
n1  5k3  and  nk2,  respectively,  if  G(A)  is  an  n-vertex  finite  element  graph  with  <  k 
vertices  on  the  boundary  of  each  face.  In  yet  another  example,  if  G(A)  is  a  3- 
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factored  as  LLT  where  L  is  a  linear  triangular  matrix,  see  [Wilkinson,  65],  [Golub 
and  van  Loan,  83].  Furthermore  such  a  decomposition  =  LLT  is  unique  if  L 
has  positive  diagonal  entries.  Then  (4.2)  and  (4.3)  imply  similar  decompositions 
for  the  matrices  Ah,Xh  for  all  h,  so  that  all  of  those  matrices  are  symmetric  posi¬ 
tive  definite.  Let  us  designate  Ah  =  LhLhT,  Xh  =  LhLhT.  Then  Ah_1  =  Lh~1(Lh",)T, 
Xh  =  Lh(Lh'l)T  and  (4.2)  and  (4.3)  imply  that 


Lk  = 


Lh  0 

YhXh-«  Lh+1 


Lw 1  = 


V*  -Xh‘YbT 
0  Lh"+i 


h=0,l,...,d-l.  Let  v  be  a  column-vector  such  that  |[v||  =  1,  ||Lhv||  =  ||Lh||,  see 
(2.1).  Then,  padding  V  with  zeros  at  the  bottom,  we  will  obtain  a  vector  v  such 


that  ||v||  =  1  and  ||Lhv||  >  ||Lhv||  =  jjlh|],  so  that  ||Lh||  >  ||Lh||.  Similarly 
II4II  >  l|LhtlH,  l|Lb-'||  >  ||Ld,||.  IILb-'ll  >  ||Lb+,||. 


By  Lemma  3.4,  all  of  those  four  bounds  combined  imply  that 


IIAhll  >  l|Xb||,  ||Ab||  >  ||Ah+1||, 

IIAk'H  >  l|X6- 'll,  ||Ab >||  >  ||Ari,||  for  all  h=0,l,...,d-l. 

Surely  cond  A  =  cond  Ao,  so  these  inequalities  immediately  imply  Lemma  4.4. 

Q.E.D. 

Remark  4.1.  The  algorithm  of  this  section  can  be  extended  to  the  cases 
where  A  is  an  arbitrary  matrix  with  small  separators  and  such  that  all  “inter¬ 
mediate”  submatrices  Xb  are  well-conditioned.  The  latter  property,  however, 

may  not  hold  even  for  some  symmetric  matrices  A  such  as  j  where  |  e  |  is 

small. 


Lemma  4.2(b)  implies 

Lemma  4.6.  The  matrix  product  Y^X^Yj,7  can  he  decomposed  into  Nh  pro¬ 
ducts  of  pairs  of  matrices  of  sizes  at  most  nh  ^  X  nhk  for  k=l,...,Nh,  where 
nh.k  —  I  Shtk  I  •  These  products  can  be  computed  in  time  0(log  s(n))2  using  a  total 


9 


9 


9 


-21- 


some  0  <  l  <  h— 1,  unless  j*  =  j**  (we  will  omit  the  latter  trivial  case).  Thus 
there  exists  a  path  =  jj ,}2} A'te’h) ,  •  •  •  »  =  j**+4}  in  Eh  visiting 

only  vertices  in  Rh;  but  the  induction  hypothesis  (a)  implies  that  there 

can  be  no  edge  in  Ej,  between  Rk  k  and  Rh  k.  for  kj^k*,  so  G  Rh  k  for  some 

unique  k.  Thus  we  have  established  the  case  (b)  of  the  lemma  for  Eh+J. 

Next  suppose  the  case  (a)  of  Lemma  4.2  does  not  hold  for  h+1;  so  there 
exists  a  path  p  in  Gh+J  between  some  i  £  Vh.k  and  j  €  Rh*,k  where  h*  >  h+1  but 
p  visits  no  vertex  v  with  jt(v)  >  6b«+1.  Case  (b)  for  h  implies  that 
{i,j }  6  Eh+1  -  Eh  only  if  there  is  a  path  in  Gh  between  i  and  j  containing  only 
vertices  j|,...,jj  €  Rh  with  x(jr)  <  6h+1,  f°r  Thus  we  can  construct  from 

p  a  path  p'  in  Gh  between  i  and  j  that  visits  no  vertex  v  with  jt(v)  >  $h*+i, 
violating  the  induction  hypothesis  for  (a),  a  contradiction.  Thus  we  have  shown 
that  (a)  holds  for  Gh+1,  establishing  the  induction  hypothesis  for  h+1.  Q.E.D. 

Lemma  4.2(a)  implies  that  Eh  contains  no  edge  between  Rh  k  and  Rh  k.  for 
k  jd  k*.  Since  maxk  (  Rh  k  |  <  maxk  s(  |  Vh  k  |  )  <  s(ad~hn),  we  immediately 
arrive  at 

Lemma  4.3.  Xh  »'«  a  block-diagonal  matrix  consisting  of  Nh  <  2d~h  square 
blocks,  where  each  block  is  of  size  at  most  s(adhn)  X  s(ad'hn). 

Lemma  4.3  implies  that  the  inversion  of  Xh~ 1  can  be  reduced  to  Nh  <  2d~h 
parallel  inversions  of  dense  matrices  (i.e.,  one  dense  matrix  is  associated  with 
each  Rh,k),  each  of  size  at  most  sfor^n)  X  s(o®  hn).  By  Corollary  2.1,  this  can 
be  done  in  0(log  s(n))2  time,  NhM(s(ad_hn))  <  2d_hM(s(ad_hn))  processors.  By 
our  assumptions,  aw<r  <  1/2,  so  this  processor  bound  is  0(M(s(n))).  In  fact,  we 
need  to  prove  the  following  lemma  in  order  to  be  able  to  apply  Corollary  2.1. 

Lemma  4.4.  cond  Xh  <  cond  A  for  all  h. 

Proof.  A|,  =  PAPT  is  a  symmetric  positive  definite  matrix,  so  it  can  be 


entry  at  the  i-th  row  and  the  j-th  column  of  Aj  is  nonzero  provided  that  here  the 
rows  are  counted  bottom-up  and  the  columns  from  right  to  left. 

The  fill-in  at  stage  h  is  the  set  of  edges  that  are  in  Eh  but  not  in  E^.  The 
following  technical  lemma  upper  bounds  this  fill-in  and  also  provides  some  useful 
information  about  the  connectivity  structure  of  Gh. 

Lemma  4.2.  Let  h  >  0.  Then 

a)  if  p  is  a  path  in  Gh  between  i  £  Vh.  k  and  j  E  Rh'.k  for  h*  >  h  and  some 

k,  then  p  visits  some  vertex  v  such  that  jt(v)  >  6h«+1,  Mof  is,  v  g  Rq  for 

q  <  h*; 

b)  Eh+1C  {{i,j}EEh  |i,j£Rh} 

U { { i , j }  I  3k  3  {i,ii},{j1J2}»  •  •  •  {i/j}  GEh} 

where  j„  .  .  .  ,  jj  E  Rh.k  and  *0)  >  *{i)  >  $h}- 

Proof  (by  induction  on  h).  We  first  consider  the  case  (a)  with  h=0. 
Observe  that  if  there  is  a  path  p  in  G0  between  some  i  g  Vh.k  and  j  E  Rh-k  for 
any  h*  >  0,  then  p  must  contain  a  vertex,  say  j*,  of  the  separator  set  Sh.+j  k., 
where  (Vh-+1  k*,Sh-+i k-)  is  the  parent  node  of  (Vk.k,Sk«k)  in  Tq.  This  vertex  j*  has 
number  7r(j*)  >  as  required. 

Suppose  that  Lemma  4.2(a)  holds  for  some  fixed  h  and  for  all  h*  >  h.  By 
definition,  Ah  =  Zh  -  YhX„->YhT.  Let  Xh  =  (x^),  Xb'  =  3^  =  (Sty),  Yh  =  (y,;), 
zh  =  (tij).  and  wh  =  YhXh-'YhT  =  (wjj).  If  {i,j}  g  Eh+1,  then  i,j  g  Rh,  by  the 
definition  of  Rh,  so  jrfi)  >  x(j)  >  6j,+j.  Furthermore  we  have  Zj-^j-^  ^  0  or 
otherwise  w^.^  jk  0.  If  7*  °-  then  {i,j}  E  Eh  as  claimed.  On  the  other 

hand,  if  w,,^.^  jk  o,  then  3  j’+^j*  *+^  E  Rh  such  that  y^.  jk  0,  5^--.  0 

and  y-..  jk  0.  It  follows  that  {i,j*+5b},{j,j*  *+^,}  €  Eh-  Furthermore  the  Cayley- 
Hamilton  Theorem  gives  Xk  1  =  CqI  +  qXj,  +...+  Ch_iXh '*  for  some  scalars 
c0,  C|,  .  .  .  ,  cn.,.  Hence  x^..  jk  0  implies  that  the  (j*,j**)  entry  of  Xk  is  not  0,  for 


d 

Observe  that,  by  definition,  Rb  n  Rj,»  =  ^  if  h  5^  h*  and  that  V  =  U  Rb. 

b«0 

Let  t.  {l,...,n}  -»  {l,...,n}  be  any  enumeration  of  the  n  vertices  of  G  such 
that,  if  v  €  Rh,  v*  £  Rh.  for  h*  >  h,  then  x(v)  <  jr(v*).  Thus  the  elements  of 
jr(Rh)  are  in  the  range  from  4+1.->4+i  w^ere  4  =  £g<h  I  Rg  I  •  Such  an 
enumeration  can  be  easily  computed  in  total  time  0(log  n)2  using  n/log  n  proces¬ 
sors  by  first  numbering  the  vertices  of  Rd  =  Sj  by  n,  n-1,...  and  then  numbering 
(also  in  the  decreasing  order)  all  previously  unnumbered  vertices  of  Rb  of  height  h 
for  each  h=d-l,d-2,...,0. 

We  now  define  the  permutation  matrix  P  such  that  P»  =  1  if  j  =  ?r(i)  and 
P~  =  0  otherwise.  Then  we  define  the  initial  matrix  A<>  =  PAPT.  Recursively, 
for  h=0,l,...,d-l, 


is  the  (n~4)  X  (n~4)  symmetric  matrix  where  Xh  is  the  |  Rh  |  X  |  Rh  |  upper 
left  submatrix  of  Ah,Yh  is  the  (n~4"  |  Rh  | )  X  |  Rh  I  )  lower  left  submatrix  of 
Ah,  and  Zh  is  the  (n-^-  |  Rh  | )  X  (n~4~  I  Rh  I )  lower  right  submatrix  of  Ah.  We 
then  define  Ah+l  =  Zh  -  YhXh_1YhT.  Thus  at  stage  h  we  have  eliminated  the  ele¬ 
ments  associated  with  Rh.  Note  that  Ah+J  is  symmetric  positive  definite  if  Ah  is, 
compare  the  proof  of  Lemma  4.4  below.  We  now  claim  that  we  can  compute 
Ah+,  from  Xh,  Zh  and  Yh  in  time  0(log  s(n))2  using  at  most  M(s(n))  processors. 
To  prove  this,  we  must  investigate  the  sparsity  structure  of  the  latter  subma¬ 
trices  of  Ah. 

Let  Ah  =  (a(jh)).  We  define  an  associated  graph  Gh  =  (Vh,Eb)  with  vertex 
set  Vh  =  {4+1,  4+2, ...,n)  and  edge  set  Eh  =  {{i+4>  j+4)  I  fti{h)  5^  °}:  that  i*. 
Gh  is  derived  from  G(Ah)  by  adding  4  to  each  vertex  number,  see  Figures  1,2, 3, 4 
and  5  below.  This  enumeration  implies  that  {n-i,n-j}  £  En  if  and  only  if  the 


Efficient  parallel  computation  of  s(n)-separators  is  not  simple  in  general  but 
it  is  rather  straightforward  in  the  practically  important  cases  of  grid  graphs  (see 
illustrative  Figures  1,2, 3, 4, 5  below);  similarly  such  computation  is  simple  for 
many  finite  element  graphs,  compare  [George, 73]. 

To  prove  Theorem  4.1,  we  will  begin  with  defining  a  binary  tree  Tq  for  each 
graph  G  £  C,  see  Figure  6.  Suppose  G=(V,E)  has  n  vertices.  If  n  <  n0,  where 
n0  is  a  given  constant,  (see  Definition  1.1),  we  let  TQ  be  the  trivial  tree  with  no 
edges  and  with  the  single  leaf  (V,S)  where  S=V.  Else,  if  n  >  n0,  we  recall  that 
we  know  an  s(n)-separator  S  of  G  so  that  we  can  find  a  partition  Vj,  V2,  S  of  V 
such  that  there  is  no  edge  in  E  between  the  sets  Vj  and  v2,  and  furthermore 
|  Vj  |  <  an,  |  V2  |  <  an,  and  |  S  |  <  s(n).  Then  Tq  is  defined  to  be  the 
binary  tree  with  the  root  (V,S)  having  exactly  two  children  that  are  the  roots  of 
the  two  subtrees  TGl,  Tq2  of  TG  whert  Gh  is  the  subgraph  of  G  induced  by  the 
vertex  set  S  U  Vh  for  h=l,2. 

Let  the  height  of  a  node  v  in  TG  equal  d  minus  the  length  of  the  path  from 
the  root  to  v  where  d,  the  height  of  the  root,  is  the  maximum  length  of  a  path 
from  the  root  to  a  leaf.  Let  Nh  be  the  number  of  nodes  of  height  h  in  TG.  Since 
Tg  is  a  binary  tree,  Nh  <  2d'h.  Let  (Vh ,,  Sh j),  ...,  (Vh  NN,Sh  nJ  be  a  list  of  these 

Nfc 

nodes  of  height  h  and  let  Sh  =  U  Sh  k.  Observe  that  by  the  s(n)-separable  pro- 

k—  1 

perty  of  C,  |  Vh  k  |  <  ad~hn  and  |  Sh  k  |  <  s(adhn)  for  each  h  >  0  and 
k  =  l,...,Nh.  Thus  d  <  c*log  n,  c*  =  l/log(l/a)  for  adn  <  n0,  |  Vok  |  <  n0  and 
S0  k  =  Vo  k  for  all  k,  by  the  definition  of  the  tree  TG. 

For  each  k  =  l,...,Nk  such  that  Sk  k  ^  Vk  k,  let  Rk  k  denote  the  set  of  all 
elements  of  Sh  k  that  are  not  in  Sh.  for  h*  >  h.  (Actually  Rk  k  =  Sk  k  -  U  Sk»k. 

where  the  union  is  over  all  ancestors  (Vh.k.,Sh«k«)  of  (Vhk,Sj^k).)  The  s(n)- 

Nk 

separable  property  implies  that  Rh  k(  n  Rh,ka  =  ^  if  k2.  Let  Rh  =  Rb,k- 


symmetric  positive  definite  matrix  A  such  that  cond  A  <  n0^  and  G=G(A),  we 
can  numerically  compute  a  recursive  s(n)-factorization  in  time  0(log  n(log  s(n))2) 
using  |  E  J  4-  M(s(n))  +  n/log  a  processors  (where  M(s(n))  is  the  number  of  pro - 
cessors  sufficient  in  order  to  multiply  two  s(n)  X  s(n)  matrices  in  time 
0(log  s(n))).  Furthermore  that  recursive  s(n)-factorization  satisfies  Lemmas  4-5 
and  4-6  below.  Whenever  such  a  recursive  s(n)-factorization  of  A  i«  available, 
Oflog  n  log  s(n))  time  and  |  E  |  +  s(n)2  processors  suffice  to  solve  a  system  of 
linear  equations  Ax=b  for  any  given  column  vector  b  of  length  n. 

Corollary  4.1.  Fix  an  n-vertex  planar  graph  G  such  that  0(Vn )-scparators 
for  G  and  for  its  subgraphs  have  been  precomputed.  If  A  is  an  nXn  symmetric 
positive  definite  matrix  with  G=G(A)  and  if  cond  A  <  n0^),  then  we  can  numer¬ 
ically  compute  a  recursive  0(\/n)-factorization  of  A  in  time  0(log  n)3  using 
M(v/n)  <  n1'25  processors.  Whenever  such  a  recursive  factorization  is  available, 
0(log  n)2  time  and  n  processors  suffice  to  solve  Ax=b  for  a  given  b. 

Corollary  4.2.  Fix  an  n-vertex  finite  element  graph  G  with  at  most  k  ver¬ 
tices  on  the  boundary  of  every  face.  Let  0(k\/n )-separators  for  G  and  for  its  sub¬ 
graphs  have  been  precomputed.  If  A  is  an  nXn  symmetric  positive  definite 
matrix  with  G=G(A)  and  if  cond  A+<  n°^%  then  we  can  numerically  compute  a 
recursive  O(k\/o)-factorization  of  A  in  time  0(log  n)3  using  M(k\/n)  <  k25n125 
processors.  Whenever  such  a  recursive  factorization  is  available,  0(log  n)2  time 
and  nk2  processors  suffice  to  solve  Ax—b  for  a  given  b. 

Corollary  4.3.  If  A  is  an  nXn  symmetric  positive  definite  matrix  with  a 
d-dimensional  grid  graph  (for  any  d  >  2)  and  if  cond  a  <  n°  then  we  can 
numerically  compute  a  recursive  (n ^-factorization  of  A  in  time  0(log  n)3 
using  <  n25_*25^d*  processors.  Whenever  sueh  a  recursive  factoriza¬ 

tion  is  available,  0(!og  n)2  time  and  n2~*2/,d'  processors  suffice  to  solve  Ax=b  for  a 
given  b. 


and  Xh  is  a  block-diagonal  matrix  consisting  of  square  blocks  of  sizes  at  most 
s(ad_hn)  X  s(od_hn)  where  n0  >  adn,  n0  is  a  constant.  (Here  and  hereafter  WT 
denotes  the  transpose  of  a  matrix  W.)  Such  a  recursive  s(n)- factorization  has 
length  at  most  d  <  0(log  n)  (whereas  a  customary  LDLT  factorization  has  length 
n).  This  recursive  factorization  is  said  to  be  numerically  computed  if  the  com¬ 
puted  approximants  of  the  induced  matrices  A],  ...,  Ad  satisfy  (4.1)  within  error 
norm  2'“*,  for  a  positive  constant  c. 

Observe  that,  by  definition  of  a  recursive  s(n)-factorization,  we  have  for 
h=0,...,d-l  the  identity 


I  0  Xk  0  I  Xfc-'Yj 

Ah=  YbV  I  0  Ak+I  o  I 

and  hence 


(4.2) 


A. 1  = 


I  -X„  'Y?  Xk-'  0  I  0 

0  I  0  Ail,  -W  1 


Thus  given  a  recursive  s(n)- factorization  (4.1),  it  is  easy  to  recursively  com¬ 
pute  A"*b  for  any  column  vector  b  of  length  n. 


Lemma  4.1.  For  all  constants  o,  a*,c  such  that  1/2  <  a*  <  a  <  1,  if 
s(n)  <  cn *  and  if  C  is  a  class  of  s(n)-separable  gTaphs  with  respect  to  a,  then  C 
is  s*(n)-separable  with  respect  to  a*,  where  s*(n)  <  cn<T/(  1— 


Lemma  4.1  can  be  proven  following  [Lipton  and  Tarjan,  79],  see  the  end  of 
this  section.  Without  loss  of  generality,  in  the  following  we  will  assume  that 
a  =  1/2  (see  Lemma  4.1),  that  M(n)  =  nw  and  that  s(n)  =  0(0*)  for  fixed  con¬ 
stants  u>  and  such  that  0  <  a  <  1,  and  2  <  w  <  3.  Furthermore  we  will 
assume  that  a  >  1/w  such  that  a  <  1/2"*  (see  Lemma  4.1),  so  a"*  <  1/2. 


Theorem  4.1.  //  C  is  a  class  of  s(n)-separa6fe  graphs  and  if  the  s(n)- 
separators  of  a  graph  G  G  C  and  of  its  subgraphs  are  known ,  then  given  on  nXn 
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Lemma  S.5.  Let  B  and  t  be  defined  by  (2.6).  Let  p  be  an  eigenvalue  of 
R(B)  =s  I  -  BA.  ThenO  <  p  <  1  -  l/((cond  A)2n). 

Proof.  Let  R(B)v  =  pv  for  v  ^  0.  Then  (I-tAHA)v  =  v  -  tAHAv  =  pv. 
Therefore  AHAv  =  Xv  for  X  =  ( 1— p)/t  so  X  is  an  eigenvalue  of  AHA.  Corollary 
3.2  implies  that  1/||A_1||2  <  X  =  (1  -  p)/t  <  ||A||2.  It  immediately  follows  that 

l-t||A||2  <l*<  l-t/IIA-'f.  (3.1) 

It  remains  to  recall  (2.6)  and  to  apply  the  inequalities  of  Lemma  3.3.  Q.E.D. 

Since  p  is  an  arbitrary  eigenvalue  of  R(B),  p(R(B))  <  1  -  I/((cond  A)2n). 
On  the  other  hand,  ||R(B)||  =  p(R(B))  since  R(B)  =  I-tAHA  is  Hermitian.  This 
completes  the  proof  of  Lemma  2.4.  Q.E.D. 

4.  Parallel  Generalized  Nested  Dissection 

In  this  section,  we  fix  a  class  C  of  undirected  graphs,  which  are  s(n)- 
I  separable  (with  respect  to  constants  n0  and  a),  see  Definition  1.1.  Let  A  be  an 

nXn  symmetric  positive  definite  matrix  with  graph  G=G(A)  in  class  C  (see 
Definitions  3.1  and  1.2).  Even  if  A  is  sparse,  and  if  G(A)  has  small  separators, 
i  A-1  may  not  be  sparse  (and  in  fact  if  G(A)  is  connected,  A-1  may  have  no  zero 

entries).  We  will  describe  an  efficient  parallel  algorithm  that  computes  a  special 
recursive  factorization  of  A.  With  that  factorization  available,  it  will  become 
very  easy  to  solve  the  system  of  linear  equations  of  the  form  Ax=b  for  any  given 
vector  b. 

Definition  4.1.  A  recursive  s(n ) -factorization  of  a  matrix  A  with  respect  to 
a,  0<o<l,  is  a  sequence  of  matrices  Ao,  Aj, ...,  Aj  such  that  A(,  =  PAPT,  P  is 
an  nXn  permutation  matrix  and  for  h=0,l . d-1, 

Ah  =  y  >  Zb  =  Ah+1 4-  YhXh-’YhT,  (4.1) 

n  d 


symmetric  positive  definite.  (VVHW  is  Hermitian  since 
(VVHW)H  =  W”(WH)H  =  WHW,  compare  Definition  2.2.) 

We  will  use  the  3  following  well  known  results,  see  [Atkinson,  78],  pp.  418- 
431,  or  [Wilkinson,  65].  (Recall  that  ||W||  denotes  the  2-norm  of  a  matrix  W.) 

Lemma  3.1.  ||W||  =  ||WH||  for  all  W;  ||W||  =  p(W)  if  W  =  WH. 

Lemma  3.2.  All  eigenvalues  of  a  Hermitian  positive  semidefinite  matrix  are 
nonnegative. 

Lemma  3.3.  Let  W  =  (wy).  Then  ||WHW|j  =  p(WHW)  =  ||W||2  < 

ll'VWH,  <  max  £  I  »ij  I  max  £  |  w,j  |  <  „||WHW||. 

1  i  1  i 

Applying  Lemma  3.3  to  W  =  AH  we  obtain  that  ||AH||2  <  1/t  where  t  is 

defined  by  (2.6).  Therefore,  by  Lemma  3.1,  ||AH||  <  l/(t||A||).  Taking  into 

account  that  ||A||  >  max  |  |  we  derive  the  first  2  inequalities  of  Lemma  2.4. 

i.j 

Corollary  3.1.  Let  A  =  (a^),  B  be  defined  by  (2.6).  Then 
l|B||  <  1/  1 1  A|  |  <  1/max  |  ai}  | . 

It  remains  to  prove  the  last  inequality  of  Lemma  2.4. 

Lemma  3.4.  Let  X  be  an  eigenvalue  of  a  nonsingular  matrix  W.  Then  1/X 
is  an  eigenvalue  of  W'1. 

Proof.  Surely  X  ^  0  for  nonsingular  W.  Let  Wv  =  Xv  for  a  vector  v  ^  0. 
Then  Wv  ^  0  and  W~’(Wv)  =  v  =  (l/X)Xv  =  (l/X)Wv.  Q.E.D. 

Corollary  3.2.  Let  X  be  an  eigenvalue  of  AHA  and  A  be  nonsingular.  Then 
l/IIA-'H2  <  X  <  ||A||2. 

Proof.  X  <  p(AHA)  =  ||A||2  by  Definition  3.1  and  Lemma  3.3.  On  the 
other  hand,  (AHA)_1  =  A_,(A_1)H,  so  (AHA)_1  is  a  Hermitian  positive  definite 
matrix.  By  the  virtue  of  Lemma  3.4,  1/X  is  an  eigenvalue  of  (AHA)-1.  Conse¬ 
quently  1/X  <  p((AHA)-‘)  =  pfA-'fA-1)”)  =  ||A-‘||2.  Q.E.D. 
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respectively,  compare  (2.1).  Let  (cond  W)#  =  ||W||8|jW-,||#  */ W  »*  nonsingular, 
s  =  oo  or  s=l. 

Remark  2.1.  Using  Definition  2.3  we  may  rewrite  (2.6)  as 
B  —  tAH,  t  —  1/(I|A||„  ||A||,). 

Substituting  this  into  (3.1)  and  using  Lemma  3.3  of  the  next  section,  we  deduce 
that 

l|R|B)||  <  l-l/fllAIUIAII.IIA-'f)  <  (2.9) 

l-l/((cond  A)^  (cond  A),). 

Furthermore  we  may  choose 

B  =  tA»,t  =  1/||A"A||,  >  1/(||Ah||,(|A|||)  =  l/(  |  A|U|A||,) 
and  deduce  from  (3.1)  and  Lemma  3.3  that,  for  this  choice  of  B, 

l|R(B)N  <  1-1/(I|AhA||,||A-,||*),  (2.10) 

which  is  a  further  small  improvement  over  (2.8)  and  (2.9).  (2.9)  and  (2.10)  lead 
to  the  respective  improvements  of  Corollary  2.2. 

Remark  2.2.  Lemma  2.4  can  be  proven  for  any  (nonsingular)  matrix  A. 
For  certain  classes  of  matrices  A  there  exist  other  options  for  choosing  an  approx¬ 
imate  inverse  B.  In  Appendix  B  we  indicate  such  options  for  some  important 
classes  of  matrices  including  Hermitian  (real  symmetric)  positive  definite,  diago¬ 
nally  dominant  and  triangular  matrices. 

3.  Proof  of  Lemma  2.4. 

Definition  3.1,  p( W),  the  spectral  radius  oj  a  matrix  W,  is  the  maximum 
magnitude  of  the  eigenvalues  of  W. 

Definition  3.2.  A  matrix  that  can  be  represented  as  WHW  for  some  matrix 
W  is  called  a  Hermitian  positive  semidefinite  matrix  (or  a  Hermitian  nonnegative 
definite  matrix).  A  nonsingular  Hermitian  positive  semidefinite  matrix  is  called 
Hermitian  positive  definite.  A  real  Hermitian  positive  definite  matrix  is  called 


(2.8) 


l|B||  <  1/llAH  <  1/max  |  ajj  | ,  ||R(B)||  <  l-l/((cond  A)2n). 

We  also  immediately  verify  the  following  estimate. 

Lemma  2.5.  Computing  B  by  (2.6)  costs  only  3n2-2n+l  arithmetic  opera • 
tions  and  2n-2  comparisons  that  can  be  performed  in  O(log  n)  time  using  n2  pro¬ 
cessors. 

Combining  Theorem  2.1,  Lemmas  2.4  and  2.5,  and  the  inequality  of  (2.7)  we 
get 

Corollary  2.1.  Let  A  =  (ajj),  c  be  arbitrary  constant,  cond  A  <  n0^. 
Then  0(log2n)  time,  M(n)  processors  suffice  to  compute  a  matrix  A'1  such  that 

1 1 A-1  -  A'1!!  <  2~“*  /  | |A| |  <  ||A-,||2-ne  /  cond  A  <  HA'1^-0'. 

Although  Corollary  2.1  covers  all  instances  A  of  practical  interest,  we  will 

also  state  the  following  immediate  generalization  of  that  corollary,  compare  (2.8) 
and  Theorem  2.1. 

Corollary  2.2.  For  any  number  k  and  for  any  nonsingular  nXn  matrix  A, 
it  is  sufficient  to  use  0(k  log  n)  parallel  steps  and  M(n)  processors  in  order  to 
compute  a  matrix  A'1  such  that 

l|A-‘||  -  HA-Ml  <  (l-l/(n(cond  A)2))2k  /  ||A||  < 
||A'I||(l-l/(n(cond  A)2))2Vcond  A  <  ||A-1|I(l-l/(n(cond  A)2))2\ 

In  particular  if  cond  A  <  C,  then  the  precision  ||A||  •  ||A_1  -  A_t||  <  e  <  1 
can  be  assured  using  0(log  n  (1+  |  log(nC2  log(l/e)  | ))  parallel  steps  and  M(n) 
processors,  if  C  and  e  are  arbitrary  positive  constants,  e  <  1. 

In  the  next  remark  and  in  Appendix  B  we  will  use  the  following  definition, 
compare  (Atkinson,  78],  [Golub  and  van  Loan,  83],  [Wilkinson,  65]. 

Definition  2.3.  UWH^  =  max  £  |  w,:  | ,  ||W|!j  =  max  £  |  w;:  |  are  the 

1  i  1  i 

operator  norms  of  a  matrix  W  =  |  w;j  ]  associated  with  the  maximum  norm 
1 1 v  1 1  oo  =  max;  |  V;  |  and  with  the  1-norm  ||v||t  =  J];  |  Vj  |  of  a  vector  v  =  (vj), 
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binary  numbers  of  the  form  a2-k  where  a  and  k  are  integers, 
|  a  |  <  2m,  0  <  k  <  m,  m  =  n0^. 

Lemma  2.3  immediately  implies  the  following 

Theorem  2.1.  Let  (2.2)  and  (2.4)  hold  and  let  c  be  a  constant.  Then 
0(log2n)  time  and  simultaneously  M(n)  processors  suffice  in  order  to  compute  a 
matrix  A-1  satisfying  (2.5). 

Here  (and  frequently  hereafter)  we  exploit  the  possibility  to  reduce  the 
number  of  processors  by  a  constant  factor  k  by  the  price  of  slowing  down  the 
computation  k  times. 

It  remains  to  choose  B  satisfying  (2.4)  for  a  nonsingular  A.  Hereafter  in  all 
expressions  E^Ej.max^maXj.maX;  j  the  integer  parameters  i  and  j  range  from  1  to 
n.  Let  us  specify  the  vector  norm  to  be  the  Euclidean  norm, 
||v||  =  (£i  |  Vj  |  2)1/,2)  and  let  us  extend  that  vector  norm  to  the  matrix  norm  by 
(2.1).  The  resulting  matrix  norm  is  called  the  2-norm.  Let  us  choose 

B  =  tAH,  t  =  l/(max  £  |  a  -y  |  max  £  |  au  | ).  (2.6) 

*  j  '  i 

Definition  2.2.  Here  and  hereafter  WH  designates  the  Hermitian  transpose 
of  a  matrix  W  =  (wy).  WH  =  (wj*),  Wj*  being  the  complex  conjugate  of  w^.  If 
\VH  =  W,  W  is  called  a  Hermitian  matrix.  If  W  is  real,  then  WH  is  the  transpose 
of  W  (which  we  will  designate  WT)  and  Hermitian  W  means  symmetric  W,  such 
that  wy  =  Wjj  for  all  i,j. 

cond  W  =  ||W||  *  ||W_1] |  >  ||I||  sss  1  if  W  is  nonsingular,  (2.7) 

cond  W  =  oo  otherwise. 

We  will  prove  the  next  lemma  in  the  next  section,  compare  also  Remark  2.1 
below  and  Remark  B.l  in  Appendix  B. 

Lemma  2.4.  Let  A  =  (ay)  be  nonsingular,  let  B  be  defined  by  (2.6)  and 
R(B)  be  defined  by  (2.2).  Then 


R(B)  =  I  -  BA,  ||R(B)||  =  q  <  1.  (2.2) 

(Actually  (2.2)  implies  that  A  and  B  are  nonsingular,  see  [Atkinson,  78],  p.  465.) 
Note  that  ||(R(B)),||  <q'-*0asi-*ooif  (2.2)  holds. 

Lemma  2.1.  Let  ||(R(B))'||  -+0  as  i  -*■  oo.  Then  A-1  =  (R(B))‘B. 

i-o 

Q.E.D. 

Proof.  A"1  =  (A  'B-'JB  =  (BA)-‘B  =  (I-RfB))'^  =  §  (R(B));B.  Q.E.D. 

i-0 

Lemma  2  l  immediately  implies 

Lemma  2.2.  Let  Bk*  =  ( (I+(R(B))2k))B.  Then  A"l-Bk  =  £  (R(B))'B. 

h-0  i— 2* 

The  relations  (2.2)  and  Lemma  2.2  imply  that 

IIA-*  -  B„-||  <  g  q‘||B||  =  q2*  ||B||  /  (1-q).  (2.3) 

i-2* 

The  first  algorithm  for  INVERT  successively  computes  (R(B))2h  and  Bh*  for 
h=0,l,...,k-l.  Recall  that,  by  the  virtue  of  (2.2),  | j(R(B))*|(  <  q1,  so  the  computa¬ 
tion  by  that  algorithm  is  stable.  In  Appendix  C  we  will  also  consider  two  other 
algorithms  that  converge  to  A'1  with  the  same  speed.  Those  algorithms  are  supe¬ 
rior  over  the  first  algorithm  for  they  are  not  only  stable  but  also  self-correcting. 

(2.3)  implies 

Lemma  2.3.  Let  (2.2)  hold  and 

q  =  1  -  l/n0^  as  n  — *  oo  .  (2.4) 

Let  c  be  a  constant.  Then  0(log  n)  iterations  of  the  algorithm  of  this  section 
suffice  in  order  to  compute  a  matrix  A'1  such  that 

||A->  -  A-'||  <  2-”'  ||B||.  (2.5) 

The  latter  precision  suffices  for  all  practical  purposes  and  cannot  be  reduced 
further  if  |  (log  ||B||)  |  <  n°  W  and  if  the  entries  of  A-1  are  to  be  represented  by 


dimensional  grid,  so  is  n2/3-separable,  then  we  have  the  same  time  bounds  as  for 
planar  and  finite  element  graphs,  and  our  processor  bounds  are  n2  and  n1-33, 
respectively.  Furthermore,  if  we  use  more  theoretical  bounds  for  matrix  multipli¬ 
cation,  say  M(n)  =  n23,  then  our  processor  bounds,  for  computing  the  special 
recursive  factorization  are  further  decreased  to  n1 25  in  planar  case,  to  n125k2-5  in 
the  case  of  n-vertex  finite  element  graphs  with  <  k  vertices  per  face  and  to  n1 87 
for  the  3-dimensional  grid. 

1.4.  Contents 

In  Section  2  we  present  our  iterative  parallel  algorithm  for  INVERT  and  its 
complexity  estimates.  In  Section  3  we  prove  the  main  lemma  of  Section  2.  In 
Section  4  we  present  our  parallel  algorithm  for  LINEAR-SOLVE  in  the  case  of 
sparse  symmetric  positive  definite  systems  and  bound  its  complexity.  In  Appen¬ 
dix  A  we  bound  the  asymptotic  parallel  complexity  of  matrix  multiplication.  In 
Appendix  B  we  consider  the  simplified  methods  for  defining  an  approximate 
inverses  of  some  special  matrices.  In  Appendix  C  we  consider  two  alternatives  to 
the  iterative  algorithm  of  Section  2.  In  Appendix  D  we  analyze  the  errors  of  our 
algorithms. 

2.  Iterative  Methods  for  Parallel  Matrix  Inversion 

Definition  2.1.  Let  a  vector  norm  be  fixed.  Then  we  extend  it  to  the  asso¬ 
ciated  operator  norm  of  matrices  so  that 

l|W||  =  max  l|Wv||  /  ||v||  (2.1) 

for  all  nonzero  vectors  v  and  for  all  matrices  W. 

In  the  following  n°  ^  denotes  the  values  bounded  by  a  polynomial  in  n  as 
n  — ♦  oo  ,  A  =  (ajj)  is  a  fixed  nXn  nonsingular  complex  matrix  (so  A'1  exists),  I 
denotes  the  nXn  identity  matrix,  R(X)  denotes  I  -  XA,  and  B  is  called  an 
approximate  inverse  of  A  if 


N„ 

of  S  M(nh,k)  processor*. 
k-l 

To  estimate  this  number  of  processors,  we  need 

Nfc 

Lemma  4.8.  For  any  7  >  1,  £  n£k  =  0(s(n)'If). 

k-l 

Proof.  Observe  that,  by  the  definition  of  the  tree  TG,  if  the  nodes 
(Vh.i,kl,Sh.1|kl)  and  (Vh_1(ka,Sh_1)kj)  are  the  children  of  the  node  (Vhk,Shk),  then 

Qh-i.k,  +  nh-i,ka  <  Qh,k  +  I  sh>k  I  <  Dh,k  +  s(ad'hn)  since  |  Sh  k  |  <  s(ad-hn). 

Nk 

This  recursive  inequality  implies  that  £  nh\  i8  maximized  when  only  one  term 

k-l 

Nk  d 

is  nonzero;  in  this  case  Y  nh7k  <  Y  s(aa~“  n)1  =  0(s(n)'1)  because  we  have 

k-i  ’  h*-h+l 

assumed  that  s(n)  =  C^n*7).  Q.E.D. 

Since  we  have  also  assumed  that  M(n)  =  n",  Lemma  4.6  implies  that 

N„ 

Y  M(nhk)  =  0(M(s(n))).  By  slowing  the  parallel  computation  down  by  a  con- 
k-i 

stant  factor,  we  can  reduce  the  processor  bounds  to  M(s(n))  thus  arriving  at 

Lemma  4.7.  Ah+I  =  Zh  -  YhX1^,Y|1  can  be  computed  in  time  0(log  s(n))2 
using  M(s(n))  processors. 

Since  there  are  only  d=0(log  n)  stages,  each  taking  0(log  s(n))2  time  and 
using  M(s(n))  processors,  we  conclude  that  0(log  n(log  s(n))2)  is  the  total  time 
required  in  order  to  numerically  compute  the  recursive  s(n)-factorization  of  A 
using  M(s(n))  processors. 

Thereafter,  we  can  solve  Ax=b  for  any  given  column-vector  b  of  length  n  by 
computing  x=A_,b  by  recursive  application  of  equation  (4.2)  for  h=0,l,...,d-l. 
The  stage  h  of  this  “backsolving”  computation  requires  parallel  multiplication  of 
a  matrix  of  size  nh  k  X  nhk  times  a  column-vector  for  each  k=l,2,...,Nh.  Thus 

Nk  * 

stage  h  can  be  done  in  time  0(log  s(n))  using  Y  nh,k  processors.  By  Lemma  4.6, 
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this  bounds  the  number  of  processors  to  0(s(n)2),  and  a  slowdown  by  a  constant 
factor  reduces  this  processor  bound  to  s(n)2.  The  total  time  for  all  d=0(log  n) 
stages  of  such  backsolving  computation  is  0(log  n  log  s  (n)).  This  completes  the 
proof  of  Theorem  4.1,  except  it  remains  to  justify  our  assumption  that  au9  <  1/2 
by  proving  Lemma  4.1. 

Proof  of  Lemma  4.1.  Let  us  first  assume  that  a*  >  1/2.  Given  any 
graph  G  £  C,  where  G  =  (V,E)  and  n  =  |  V  |  >  n0i  we  can  partition  V  into 
disjoint  sets  V;,V2*,S*  such  that  |  |  <  o*n,  |  V2*  |  <  a*n,  |  S*  |  <  0(s(n)) 
and  Vj*  and  V2*  are  not  connected  by  E.  That  partition  can  simply  be  found  by 
appropriately  combining  the  Nh  <  (a*  -  l/2)l/log2a  =  0(1)  vertex  subsets 
found  at  depth  h  =  d-log((a*  -  l/2)/log  a)  of  TG.  The  separator  Sh*  is  defined  to 
be  the  union  of  all  separators  of  depths  >  h  in  TG,  and  since  there  is  only  a  con¬ 
stant  number  of  such  separators,  |  Sh*  |  <  0(s(n)).  Then  Vj*  can  be  any  maximal 
collection  of  the  vertex  subsets  of  depth  h  in  TG  where  n/2  <  |  V\*  |  <  a*n. 
Letting  V2  =  V  -  V,*  -  Sh*,  we  have  |  V2  (  <  n-(n/2)  <  n/2  <  o*n,  so  C  is 
s*(n)-separable  with  respect  to  a*  where  s*(n)  =  0(s(  i))  provided  that  a*  >  1/2. 
This  already  justifies  the  assumption  that  au9  <  1/2  if  o  >  1/w,  which  is  actu¬ 
ally  sufficient  for  our  purpose.  The  extension  to  the  case  a*  =  1/2  and  the 
decrease  of  the  constant  hidden  in  0(s(n))  to  l/(l-o<7))  are  obtained  similarly  to 
the  proof  of  Corollary  3  of  [Lipton  and  Tarjan,  79).  Q.E.D. 


Figure  5.  The  graph  G4  derived  from  Gs  by  simultaneous 
elimination  of  R3  =  {37,38,... ,42}. 
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Appendix  A.  Asymptotic  Parallel  Complexity  of  Matrix  Multiplication 

Theorem  A.l.  Suppose  that  the  product  of  two  NxN  matrices  can  be  com¬ 
puted  in  0(NW)  arithmetic  operations  for  u  >  2.  Then  such  a  product  can  be 
computed  in  parallel  time  0(log  N)  involving  at  most  Nw  processors. 

Proof.  We  recall  the  well  known  class  of  bilinear  algorithms  of  rank  M  for 
the  evaluation  of  an  nXn  matrix  product  XY  =  (Ej  X;j  yjk).  Such  algorithms  first 
compute  the  2M  linear  functions 

L,  =  £  f(U»q)*ij»  K  =  £  nj,M)yjk.  q=i,-,M,  (A.i) 

«.j  i> 

then  the  M  products  LqLq,  and  finally  Ej  X;j  yjk  as  the  linear  functions  in  those 
products, 

S  *ij  y,k  =  s'  f"(k,i,q)  L,L,V  (A.2) 

i  q— o 

Here  f(i,j,q),  f*(j,k,q)  and  f*(k,i,q)  are  constants.  Such  a  bilinear  algorithm  can 
be  applied  recursively,  substituting  nXn  matrices  for  the  variables  Xjj  and  yjk  in 
(A.l),  (A.2)  and  applying  the  same  algorithm  in  order  to  multiply  Lq  and  Lq, 
which  become  nXn  matrices  in  that  case.  (Such  recursive  bilinear  algorithms 
compute  nhXnh  matrix  product  involving  0{Mh)  arithmetic  operations.)  It  is  well 
known,  see  [Strassen,  73),  [Coppersmith  and  Winograd,  82],  that  for  large  n  there 
exist  the  above  recursive  bilinear  algorithms  of  rank  M  <  nw  that  evaluate  an 
nXn  matrix  product,  provided  that  the  assumption  of  Theorem  A.l  holds. 

Let  us  estimate  the  parallel  complexity  of  such  recursive  bilinear  algorithms 
for  nhXnh  matrix  multiplication,  that  is,  the  time  T(h)  and  the  number  of  pro¬ 
cessors  P(h)  involved  in  such  algorithms. 

(A.l), (A.2)  immediately  imply  that  simultaneously 

T(h)  <  T(h-l)  +  2  log  n  +  log  M  +  5, 

P(h)  <  max{2n2h,  Mn2h‘2,  P(h-l)M). 
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Choosing  h  — *  oo  we  immediately  deduce  the  desired  estimates  of  Theorem  A.l  in 
the  case  where  N  is  a  power  of  n.  The  extension  to  arbitrary  NxN  matrices  is 
also  immediate  via  embedding  those  matrices  into  nhXnh  matrices  banded  with 
zeros  for  h  =  flog  N/log  ri| . 

Appendix  B.  Approximate  Inverses  in  Some  Special  Cases. 

In  this  section  we  will  improve  the  choice  (2.6)  of  an  approximate  inverse  of 
A  for  certain  classes  of  matrices  A. 

Modification  B.l,  compare  (Isaacson  and  Keller,  66},  p.  84.  Let  X|,X2,...,Xn 
be  the  eigenvalues  of  AHA  such  that 


X,  >  X2  > 

(As  follows  from  Lemmas  3.3  and  3.4, 


>  K  >  o. 


*1  =  l|A||2,  X„  =  1/||A-'||2.)  (B.l) 

If  for  a  given  matrix  A  we  may  precompute  Xt  and  Xn,  then  we  may  improve 

the  choice  (2.6)  for  the  approximate  inverse  B  of  A  by  setting 

B  =  t*AH,  t*  =  2/(X,  +  X„).  (B.2) 

(B.l)  and  (B.2)  imply  that  ||R(B)||  =  (X,  -  Xn)/(Xj  +  Xn),  and  this  leads  to  the 
following  equations,  which  improve  over  the  bounds  of  Lemma  2.4, 

q  =  ||R(B)||  =  ^  =  1  -  *  .  •  (B.3) 

(cond  A)M-1  (cond  A)N- 1 

(Indeed,  ||R(B)||  =  p(R(B)),  see  Lemma  3.1,  and  the  eigenvalues  of  R(B)  are  just 
Pi  =  1— t*Xj,  i=l,...,n,  see  the  proof  of  Lemma  3.5,  so  that  l-t*X|  <  fi,  <  l-t*Xn, 
i=l,...,n.  It  remains  to  recall  (B.l), (B.2)  in  order  to  arrive  at  (B.3).) 

The  improvement  over  the  estimate  of  Lemma  2.4  is  limited,  however,  that 
is,  1-||R(B)||  increases  less  than  2n  times  even  if  we  may  precompute  Xj  and  X0 
satisfying  (B.l)  and  then  replace  (2.6)  by  (B.2). 

Modification  B.2.  If  the  matrix  A  is  Hermitian  positive  definite,  (see 


Definition  3.1),  which  is  frequent  in  practical  computations,  see  [Golub  and  van 
Loan,  83],  [Hageman  and  Young,  81],  [Young,  71],  then  we  may  choose 

B  =  t'  I,  t'  =  1/max  £  |  ajj  |  <  1/||A[|  <  1/max  |  a^  |  . 

j  i  >j 

Extending  the  proof  of  Lemma  2.4,  we  derive  in  this  case  that 

||R(B)||  <  1-t'  /||A-»||  <  1  -  V(n,/2  cond  A).  (B.4) 

Remark  B.l.  Actually  we  may  reduce  the  inversion  of  an  arbitrary  non¬ 
singular  matrix  A  to  the  inversion  of  the  Hermitian  positive  definite  matrix  AHA 
since  A"1  =  (AHA)-,AH.  Then  we  will  arrive  at  the  bound  (B.4)  (where  AHA 
replaces  A)  if  we  compute  the  1-norm  ]|AHA|]j  of  the  matrix  AHA  and  choose 
B  =  I/||AhA||„  see  Definition  2.3.  The  resulting  estimate  (B.4)  will  be  close  to 
(2.8)  and  (2.10). 

Modifications  B.l  and  B.2  can  be  combined  together  if  A  is  a  Hermitian  posi¬ 
tive  definite  matrix  and  if  ||A|]  and  1/||A*,||  can  be  precomputed,  in  which  case 

setting  B  =  2l/(||A||  +  1/||A_,||)  we  would  yield  ||R(B)||  =  1 - - . 

cond  A+l 

Modification  B.3.  Case  1.  Let  A  =  (a^)  be  strongly  diagonally  dominant, 
that  is,  let  for  a  constant  c 

(2— l/nc)  |  ajil  >  X)  I  84j  !  for  all  i  (B.5) 

j 

or 

(2-l/ne)  J  &jj  |  >  £  |  ajj  |  for  all  j.  (B.6) 

i 

The  class  of  diagonally  dominant  matrices  is  very  important  in  applications,  see 
[Varga,  62]  and  [Hageman  and  Young,  81].  (B.5)  implies  that  A  is  strongly  row - 
diagonally  dominant;  (B.6)  implies  that  A  is  strongly  column-diagonally  dominant. 
In  both  cases  we  choose 


B  =  diagfl/an.l/a^,  •  •  *  ,1/a^}. 


(B.7) 


Lemma  B.l,  compare  Definition  2.3.  ( B.5 )  and  (B.  7)  imply  that 

l|R(B)H„  <  1  -  I/O'.  (B.8) 

(B.6)  and  (B.7)  imply  that 

l|R(B)||i  <  1  -  l/nc.  (B.9) 

Proof.  For  all  i  the  row-vector  i  of  the  matrix  R(B)=I-BA  has  the  1-norm 

equal  to  |  a;j  |  /  |  a;i  |  which  is  not  greater  than  l-l/ne  if  (B.5)  holds.  This 

i*i 

immediately  implies  (B.8).  (B.6)  and  (B.9)  turn  into  (B.5)  and  (B.8)  if  A  replaces 

Ah. 

Thus  in  the  cases  where  A  is  strongly  diagonally  dominant  we  choose  B 
defined  by  (B.7)  and  again  arrive  at  the  bounds  of  Corollary  2.1  provided  that 
one  of  the  two  norm  of  Definition  2.3  substitutes  for  the  2-norm.  The  computa¬ 
tion  of  B  by  (B.7)  involves  only  n  arithmetic  operations  that  can  be  performed 
using  1  parallel  step  and  n  processors. 

Modification  B.3.  Case  2.  Let  A  =  (a^)  be  a  triangular  matrix.  Then 
define  B  by  (B.7)  and  apply  the  algorithm  of  Section  2  for  computing  Bk*  for 
k  =  pog  n] .  Although  B  does  not  satisfy  (2.2)  in  this  case,  the  method  works 
simply  because  R(B)  is  an  nXn  triangular  matrix  whose  diagonal  is  filled  with 

zeros  and  therefore  R(B)1  =  0  if  i>n.  Thus  A~*  =  B^‘  =  £)(R(B))'B,  so  that 

i-o 

the  convergence  of  the  algorithm  to  A-1  in  pog  nj  steps  immediately  follows  for 
any  triangular  nonsingular  matrix  A  even  if  cond  A  is  exponentially  large,  com¬ 
pare  (Csanky,  76)  and  [Heller,  73).  (Note  that  the  inversion  of  a  matrix  A  can  be 
reduced  to  the  inversion  of  triangular  marices  if  the  QR-factors  of  A  or  LUP- 
factors  of  A  are  available.) 

Appendix  C.  Refinement  of  the  Approximate  Inverse  of  a  Matrix  by 
Newton’s  Iteration  and  by  the  Residual  Correction  Method. 


Next  we  will  consider  two  alternatives  to  the  first  iterative  algorithm  for 
INVERT  described  in  Section  2.  As  in  the  case  of  the  first  algorithm,  we  will 
assume  that  an  approximate  inverse  B  of  A  has  been  precomputed  such  that  (2.2) 
holds. 

The  second  algorithm  performs  Newton’s  iteration  for  the  matrix  equation 
R(X)  =  0  by  successively  computing 

Bt,  =  (21  -  Bh.,A)Bh_1  =  (I  +  R(Bh_,))Bh_,  (C.I) 

for  h=l,2,...  and  letting  B0  =  B.  It  is  readily  verified  that  R(Bh)  =  (R(B|,_,))2  for 
h=l,2,...,  so  that  I  -  BfcA  =  R(Bk)  =  (R(Bo))2t  for  all  k.  We  postmultiply  the 
latter  equations  by  A-1,  note  that  (2.2)  and  Lemma  2.1  imply  that 


I|A-‘II  <  l|B||/(l-q), 

(C.2) 

and  arrive  at  the  bound 

l|A-'-Bk||  <  q^llBHAi-q), 

(C.3) 

compare  (2.3). 

The  third  algorithm  relies  on  the  well  known  residual  correction  method,  see 
[Atkinson,  78],  p.  469.  We  successively  compute 


Xb+llg  =  Xh,g  +  Bg(I  -  AXhig),  Xo,g  =  Bg  for  h=0,l,...,s-l,s>2  (C.4) 

where  g=0  and  B0  =  B  satisfies  (2.2).  Then  we  recursively  perform  the  iteration 
sweep  (C.4)  for  g=l,2,...,  choosing  Bg  =  X,  g.  Within  each  iteration  sweep  (C.4) 
I  -  Xh+I  gA  =  (I  -  Xh  rA)(I  -  BjA),  h=0,l,...,s-l,  see  [Atkinson,  78],  so  that 
||I  -  Bg+1A||  <  ||I  -  BjAll8.  Therefore  after  k  iteration  sweeps  (C.4)  we  arrive  at 
an  approximation  B^  to  A-1  such  that  ||I  -  B^A] j  <  q*4.  We  postmultiply  I  -  B^A 
by  A-1,  recall  (C.2)  and  deduce  that 

HA"1  -  BfcH  <  qBk!|Bl!/(l-q).  (C.5) 

For  s=2,  (C.5)  is  similar  to  (2.3)  and  (C.3). 

Unlike  the  algorithms  of  Section  2,  both  algorithms  of  this  appendix  are 


self-correcting;  the  errors  of  computing  and  Xhg  for  any  k,h,g  do  not  pro¬ 
pagate,  that  is,  they  are  automatically  corrected  at  the  next  iterations  provided, 
of  course,  that  the  iterations  are  not  too  much  contaminated  by  the  errors  (it  is 
necessary  that  in  spite  of  the  errors  the  computed  matrices  and  Xj,g  remain  to 
be  approximate  inverses  of  A),  see  the  next  appendix. 


Appendix  D.  Some  Error  Estimates. 

Let  us  assume  that  the  computation  of  B  by  (2.6)  and  the  subsequent 
Newton’s  iteration  (C.l)  have  been  performed  with  finite  precision  chopping  to  d 
binary  digits  and  let  us  estimate  the  total  round-off  errors  of  the  computation. 

Let  A(u)  and  6  (u)  =  designate  the  absolute  and  the  relative  errors  of 

INI 

computing  u  where  u  can  be  a  matrix  or  a  scalar  (in  the  latter  case 
||u||  =  (  u  |  ,  ||Au||  =  |  Au  |  ).  As  this  is  customary,  we  will  assume  that 
d  -*  oo  and  will  ignore  the  values  of  smaller  orders  of  magnitudes. 

We  immediately  derive  from  (2.6)  that  6  (1/t)  <  21_d,  6  (B)  <  3  ■  2~d. 

Since  ||B||  <  1/||A||,  see  (2.8),  it  follows  that 

||I-(B  +  AB)A||  <  ||I-BA||  +  3  •  2'd  <  1+3  •  2"d-l/(n(cond  A)2). 

Therefore  it  is  sufficient  to  choose  d  =  0(log  n)  in  order  to  assure  that,  say 

||I-(B+  AB)A||  <  l-0.8/nc  (D.l) 

provided  that  ||I-BA||  <  l-l/nc.  Therefore  if  we  compute  B  by  (2.6),  it  is 
sufficient  to  choose  d  =  0(log  n)  in  order  to  assure  that  the  computed  approxi¬ 
mation  matrix  B  +  AB  will  remain  practically  as  good  as  the  approximate 
inverse  B,  that  is,  the  total  number  of  iterations  sufficient  for  computing  A-1  with 
required  precision  may  increase  at  most  by  1  if  B  +  AB  substitutes  for  B.  (Note 
that  the  first  iteration  squares  the  residual  norm  and  turn  l-0.6/ne  into 
l-1.2/ne+0.38/n2e,  which  is  less  than  l-l/ne  for  large  n,  compare  (D.l).) 


Next  let  us  assume  that  B^i  and  A  are  given  and  that  Bb  has  been  com* 
puted  by  (C.l)  using  d  binary  digits.  If  the  straightforward  algorithm  for  matrix 
multiplication  is  used,  then  the  error  bounds  |fA(Bb_1A)||  and  ||A(Bh)||  can  be 
immediately  estimated  applying  the  backward  error  analysis,  see  [Wilkinson,  65]. 
The  error  analysis  is  not  too  hard  even  if  the  matrix  multiplications  in  (C.l)  are 
performed  by  asymptotically  fast  algorithms.  In  [Pan,  84),  pp.  117-127,  the  error 
analysis  is  presented  for  such  fast  sequential  algorithms;  the  analysis  does  not 
change  in  the  case  of  parallel  computation.  In  particular,  Corollary  23.4,  p.  121, 
and  Theorem  24.2,  p.  127,  of  [Pan,  84]  imply  that 

l|A(VlA)ll  <  .  ||A||.  (D.2) 

Let  ||I— B^AJI  =  q,  so  that  HI-BgAH  <  q2*  for  ail  g.  Then  (C.l)  implies  that 

lliyi  <  (l+l|I-Bh-,A||)|fBfc_1||  <  (I+q2,)||Bh.,H  <  (l+q+qJ+  •  •  •  +qh)||B0||, 
compare  Lemma  2.2.  Therefore 

HBhll  *  !!A]|  <  (l/(l-q))||Boll  *  IIAII  <  l/(l-q), 
since  B  =  Bq  is  defined  by  (2.6),  compare  (2.8).  Substitute  this  into  (D.2)  and 

deduce  that 

||A(Bh_,A)||  <  2-dn°(1V(l-q)  <  2-dn°W 

if  1-q  >  l/n0^. 

Similarly  we  extend  this  bound  to  the  following  estimate. 

IIAfBbHI  <  2-V>  «*>||l+R(Bh_1)||  .  lfBb.,11  <  ^Ol'VlIAH 

so  that 

IIHBh  +  A(Bh))A||  <  |[I-BhA||  + 

If,  say  ||I-Bj,A]J  <  2_nM,  then  it  is  sufficient  to  choose  d=ne+0(log  n)  in  order 
to  assure  that  ||I-(Bj,  +  A(Bi,))A||  <  2-n‘.  Then  again  Bh  +  A(Bh)  serves  practi¬ 
cally  as  well  as  Bj,  in  the  subsequent  iterations  (C.l),  (that  is,  1-2  extra  iterations 
shall  guarantee  at  least  the  original  precision  of  the  output  if  Bb  +  A(Bb)  replaces 


